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Abstract 

We propose a method for variable selection in multiple regression with random predictors. 
This method is based on a criterion that permits to reduce the variable selection problem 
to a problem of estimating suitable permutation and dimensionality. Then, estimators for 
these parameters are proposed and the resulting method for selecting variables is shown to be 
consistent. A simulation study that permits to gain understanding of the performances of the 
proposed approach and to compare it with an existing method is given. 
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1. Introduction 

The selection of variables and models is an old and important problem in statistics, and several approaches 
have been proposed to deal with it for various methods of multivariate statistical analysis. For linear 
regression, many model selection criteria have been proposed in the literature. Surveys on earlier work in 
this field may be found in [6, 13, 14], whereas some monographs on this topic are avalaible (e.g.,[7, 8]). Most 
of the methods that have been proposed for variable selection in linear regression deal with the case where 
the covariates are assumed to be nonrandom; for this case, many selection criteria have been introduced in 
the literature. These include the FPE criterion ([13, 14, 12, 15]), cross-validation ([16, 11]), AIC and Cp type 
criteria (e.g., [4]), the prediction error criterion ([5]), and so on. There is just a few works dealing with the 
case where the covariates are random, although its importance that have been recognized in [2] who argued 
that that this case typically gives higher prediction errors than the fixed design counterparts and hence more 
is gained by variable selection. Linear regression with random design were considered in [17, 9] for variable 
selection, but these works only deal with univariate models, that is models for which the response is a real 
random variable. A recent work that considered multiple regression model is [1] in which a method based on 
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applying an adaptative LASSO type penalty and a novel BIC-type selection criterion have been proposed in 
order to select both predictors and responses. 

In this paper we extend the approach introduced in [9] to the case of multiple regression. In Section 
2, the multiple regression model that is used is presented as well as a statement of the variable selection 
problem. Then, the used criterion is introduced and we give a characterization result that permits to reduce 
the variable selection problem to an estimation problem for two parameters. In Section 3, we propose our 
method for selecting variables by estimating the two previous parameters, and we prove its consistency. 
Section 4 is devoted to a simulation study which permits to evaluate finite sample performances of the 
proposal and to compare it with the method given in [1]. Proofs of lemmas and theorems are given in 
Section 5. 


2. Model and criterion for selection 

In this section, the multiple regresion model in which we are interested is introduced and a statement of the 
corresponding variable selection problem is given. It is described as a problem of estimation of a suitable set. 
A criterion permitting to characterize this set is propsed as well as an estimator of this criterion. Finally, 
we give a result that permits to obtain asymtotic properties of this estimator. 


2.1. Model and statement of the problem 

We consider the multiple regression model given by: 

Y = BX + e (1) 


where X and Y are random vectors valued into and K'? respectively with p > 2 and q> 2, B is aqxp 
matrix of real coefficients, and e is a random vector valued into and which is independent of X. Writing 
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it is easily seen that Model (1) is equivalent to having a set of p univariate regression models given by: 

p 

Yi = bijXj + £i, z = 1, • • • , g, 
i=i 



( 2 ) 


and can also be writen as 


where 


Y — ^ Xjh,j + e 

^ \ 

b2j 

b,j — 

\ / 


(3) 


We are interested with the variable selection problem, that is identifying the Xj's which are not relevant 
in the previous set of models, on the basis of an i.i.d. sample of (A,T). We say that a 
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variable Xj is not relevant if the corresponding coefficients vector b,j is null. So, putting I = {1, ■■■ ,p} we 
consider the subset Iq = {j € I / ||b,j||K<i = 0} which is assumed to be non-empty, and we tackle the variable 
selection problem in Model (1) as a problem of estimating the set Iq or, equivalently, the set Ii = I — Iq. In 
order to simplify the estimation of /i we will first characterize it by means of a criterion which introduced 
below. 

2.2. Characterization of Ji 

Without loss of generality, we assume that X and Y are centered; thus, that is also the case for e. Fur¬ 
thermore, denoting by || • HRib the usual Euclidean norm of we assume that E (||-^|1 rp) < +oo and 
E (||E||g,) < - 1 - 00 . Then, it is possible to define the covariance operators 

Vi = E{X (g) X) and V 12 = E (E 0 X), (4) 

where 0 denotes the tensor product of vectors defined as follows: when i? and -F are euclidean spaces and 
(u, v) is a pair belonging to E x F, the tensor product u ® n is the linear map from E to F such that 

VhGE, (wSi v) (h) = {u,h)^ v, 

where (■; ')e denotes the inner product in E. 

Remark 1. In all of the paper, we essentially use covariance operators, but the translation into matrix 
terms is obvious and more details can be found in [3]. Particularly, when u and v are vectors in and 

respectively, the matrix related to the operator u ^ v, relative to canonical bases, is vu^ where is the 
transpose of u. So, if matricial expressions are prefered to operators, one can identify the operators given in 
(4) with the matrices Vi = E [XX"^) and V 12 = E [XY^). 

In all of the paper, the operator Vi is assumed to be invertible. For any subset K of /, let Ak be the 
projector 

X = G ^ G 

and put Hfc '■= A*j^ Ak, where A* denotes the adjoint operator of A. Then, we introduce the 

criterion 

= \\V 12 — ViXIkVi 2 \\ (5) 

where || • || denotes the usual operator norm given by ||A|| = -^tr {A*A). This criterion permits to give a 
more explicit expression of Ii as stated in the following lemma. 

Lemma 1. We have Ii C K if, and only if, f,K = 0- 

This lemma permits to characterize the fact that an interger i belongs to /q. Indeed, since having i G Iq is 
equivalent to having Ii G I — {i}, we deduce from this lemma that one has i G Iq if, and only if, fxi = 0 
where = I—{i}. Then Ii consists of the elements of I for which ^k, does not vanish. Now, let us consider 
the unique permutation tr of I satisfying: 

(i) ^Kcril) > f,Ka{2) > ■> ^Ka{p)', 

(ii) f,Ka(i) = CkctU) and i < j imply tr (f) < tr (j). 

Since Iq is a not empty, there exists an integer s G I, that we call the dimensionality, satisfying 

£.Kcr{l) > f.Ka{2) > • • • > ^Ka(s) > 0 = Cifcr(s+1) = ' ' • = C/fcr(s+l)- 

Therefore, we obviously have the following characterization of Ii. 

Lemma 2. Ii = {<j{k) / 1 < fc < s}. 
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This result shows that estimation of Ii reduces to that of the two parameters a and s. So, our method for 
selecting variables will be based on estimating these parameters; in the next subsection, an estimator of the 
used criterion will be introduced. That will be the basis of the proposed procedure for variable selection. 

2.3. Estimation of the criterion 

Recalling that we have an i.i.d. sample of {X,Y), we consider the sample means 

n n 

k^l 

and the empirical covariance operators 

n 

/c=l 


and 

n 

rW = n-^ - F^”^) 0 

k^l 

Then, for any X C /, an estimator of is given by 

= 11 ^ 2 ”^ - 

where 11^^ = Ak■ The result given below permits to obtain asymptotic properties of 

this estimator. As usual, when E and F are Euclidean vector spaces, we denote by C{E, F) the vector space 
of operators from E to F. When E = F, we simply write C{E) instead of C{E,E). Each element A of 
£(]gp+9) can be writen as 

All Ai2 

A21 A22 

where An G £(]R^’), A 12 G £(]R‘^,Rp), A 21 G and A 22 G £(]R'^). Then we consider the projectors 

Pi : A e £(KP+«) All e £(K^) and P 2 : A G £(RP+«) A 12 G /:(K«,KP), 



and we have: 

Proposition 1. We have 




where 5 k = El 2 - Ein_ft:Ti 2 , is a sequence of random operators which converges almost surely, 

as n ^ + 00 , to the operator '^k 0 /£(£(Kp+'?), £(]R‘?, R^’)) given by: 

«'ic(A) = P2(A) - Pi(A)nicEi2 + EinKPi(A)nicEi2 - EinKP2(A), 

and „gN* is a sequence of random variables valued into £(R.P“''‘?) which converges in distribution to 

random variable Ft having a normal distributon with mean 0 and covariance operator given by: 

r = E ((Z 0 Z - E)0(z 0 Z - E)) , 

Z being the -valued random variable given by 


Z = 


X 

Y 


and 0 is the tensor product between elements of related to the inner product < A, B >= tr {A*B). 
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3. Selection of variables 


Lemma 2 shows that estimation of Ii reduces to that of <j and s. In this section, estimators for these two 
parameters are proposed and consistency properties are established for them. 


3.1. Estimation of a and s 

Let us consider a sequence (/n)„gr^. of functions from / to M+ such that there exists a real a G ]0, l/2[ and 
a strictly decreasing function f : I ^ K+ satisfying: 


Vi e I, lim (n“ /„ (i)) = f {i) ■ 

n—>■+00 

Then, recalling that Ki = I — {i}, we put 

and we take as estimator of a the random permutation of I such that 


lin) 


> 


Tin) 


> ■■■> 






and if with i < j, then (i) < (j). Furthermore, we consider the random set 


r(") _ 

(j ); 1 < j < i} and the random variable 

+ gu (i)) (i G I) 

where (5n)„gN* ^ sequence of functions from / to K+ such that there exist a real /3 G ]0,1[ and a strictly 

increasing function g : I ^ K-|_ satisfying: 

Vi G I, lim (+ 5 „ (i)) = g (i). 

n^+oo 

Then, we take as estimator of s the random variable 


gl") = min < i G / / = min 

i j&i 

The variable selection is achieved by taking the random set 

rt'o _ 

1 1 — 


+’)}■ 




as estimator of h. 

3.2. Consistency 

The following theorem establishes consistency for the preceding estimators : 

Theorem 2. ITe have: 

(i) lim„^+oo P = (t) = 1; 

(ii) converges in probability to s, as n ^ +oo. 

As a consequence of this theorem, we easily obtain: hm„^+oo P ~ ~ shows the consistency 

of our method for selecting variables in the model (1). 


4. Simulations 
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Sample size 

Proposed method 

ASCCA 

50 

0.00105 

5.323e-6 

100 

0.00012 

0.00052 

500 

1.009e-6 

9.075e-6 

800 

20602e-7 

5.789e-7 

1000 

1.243e-7 

1.308e-7 

2000 

1.436e-8 

1.692e-8 


Table 1; Average of prediction errors over 2000 replications 


In this section, we report results of a simulation study which was made in order to check the efficacy 
of the proposed approach and to compare it with an existing method: the ASCCA method introduced 
by An et al. (2013). This latter method is based on re-casting the multivariate regression problem as a 
classical CCA problem for which a least quares type formulation is constructed, and applying an adap- 
tative LASSO type penalty together with a BIC-type selection criterion (see [1] for more details). Our 
simulated data is based on two independent data sets: training data and test data, each with sample size 
n = 50, 100, 500, 800, 1000, 2000. The training data is used for selecting variables by using both our method, 
with penalty terms /„ (i) = i“^and (f) = i, and the ASCCA method. The test data is used 

for computing prediction error given by 

e = l^||yW-yW||2,, 

/c=l 


where is an observed response and is the usual linear predictor of computed by using the 
variables selected at the previous step, that is = (X^X) where X is a matrix with n rows and 

columns containing the observations of the Xj^s that have been selected in the previous step. Each data set 
was generated as follows: is generated from a multivariate normal distribution in with mean 0 and 

covariance cou(a|^\ xj^^) = for any 1 < < 7, and the corresponding response is generated 

according to (1) with 


B = 


/ 3 0 

4 0 

5 0 

6 0 
\ 7 0 


0 1.5 0 0 2 \ 

0 2.5 0 0 -1 

0 0.5 0 0 3 

0 3 0 0 1 

0 6 0 0 4 / 


and the related error term having a multivariate normal distribution in R® with mean 0 and covariance 
matrix 0.5 /s, where I 5 denotes the 5-dimensional identity matrix. The outputs of the numerical experiment 
are the averages of the aforementioned prediction errors over 2000 independent replications. The results are 
reported in Table 1. Our method gives the better results for n > 100 but was outperformed by the ASCCA 
method for n = 50. 


5. Proofs 


5.1. Proof of Lemma 1 


6 








Denoting by {rt,A,P) the considered probability space, we consider the operators: 



iCl \ 

P 

/ 

Li : X = 

: G R?’ H- 

y^ XjXj G L^{fl,A, P) and L 2 

: y = 


\ Xp J 

j=i 

V 


with adjoints are respectively given by: 


<? 

2 = 1 


Ll: Z € L‘^{n,A,P) I —> E(ZX) G W, and : Z € L‘^{n,A,P) i —^ E(Zy) G 


It is easy to verify that L^Li = Vi and LIL 2 = V 12 . Denoting by R{A) the range of the operator A, and 
from the fact that the orthogonal projector onto R{A) is given by = A{A*A)~^A*^ we clearly 

have 


u = \\L\L 2 - LIL^A*j,{AkLIL^A*j,)-^AkLIL 2 \\ = \\L\L 2 - = ||Ljn^(L,,i|^)rL 2 ||, (6) 


where denotes the orthogonal space of the vector space E. For any vector a = (ai, • • • , aqY" in R"?, one 
has 


p 


q p 


1 / 2 ( 0 :) — ctiYi — Oi j bijXj + Ei j — aibijXj + 

2=1 j—1 i—1 


OtiSj, . 


Since for any u = (mi, 


,Up)'^ G 


i=l 

b we have 


L\(^U^ ^ CX^Si ^ ^ ( Uj Xj ^ (XiSi ^ ^ ( UjCXi^ i^XjSi) — ^ ( UjCXj^ — d; 

i=i i=i i=i 

it follows that a^Ei G R{Li)^ and, from R{Li)^ C R{LiA*j^)^, we obtain 

^in_R(LiA^))^OiEi = L*Q:iEi = E(aieiX) = ai'E(ei)'E{X) = 0. 

Thus, 


q p 

iin_R(LiAj.))-L-^2(a) = ajbjjXj 

i=i i=i 


q 

i=l 


(7) 


where 


( bii ^ 

bi2 


\ bip / 


If = 0, then considering, for i = I, • • ■ , g, the vector a = (0, • • ■ , 0,1,0, • • • ,0) of whose coordinates 
are null except the Tth one which equals 1, we deduce from (7) that = 0. Since, for 

any operator A, ker(A*yl) =ker(A), it follows that we have = 0, that is 


Ti(bi,) G R{LxA*j^). 


( 8 ) 


Denoting by \K\ the cardinality of K and putting K = {ki,k 2 i--- ,k\K\}^ we deduce from (8) that there 
exists a vector P = (/3i, • • • ,P\k\)^ G such that Li(bi,) = LiA’^P, that is 

p \K\ 

bijXj = PeXki 

1=1 e=i 
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and, equivalently, 


1^1 

{bikt ~ Pi) ^ke + bijXj= 0 . 

t=\ tel-K 


( 9 ) 


Since Vi is invertible we have ker(Li) = ker(L5^Li) = ker(bi) = {0}. Then, Xi,--- ,Xp are linearly 
independent and, therefore, (9) implies that, for all j S I — K, bij = 0. This property holds for any 
i G {1, • • • ,q}, then we deduce that I — K C Iq and, equivalently, that Ji C K. Reciprocally, we first have 


t=i 






bijXj + y]] bijXj 

\j&K j&I-K 

'\K\ 

bike^ki + y^ bijXj 
i=l j&I-K 


If h C K, then I — K C Iq and, consequently, for all j G I — K, bij = 0. Thus 

(^hk,Xk}j = = 0 

because LiA'^(hi,) S R{LiA^). Then, from (7) and (6), we deduce that = 0. 

5.2. Proof of Proposition 1 

We have: 


= \\V^iyi 2 ‘ - V 12 ) - - ^k) ] Ri 




(n) 


'12 


- ViUk (- 142 )) + V^6 k\\ 


and since 




{{AkVI^^A},)-^ - {AkV.A},)-^) 
A*,, (^-{AKVy^A*i,)-^ (^AKVy^A*j^ - 


= -n 


- Fi) Uk, 


it follows: 


= \\Mvy^-Vi2)-V^iviy-v,)ny^vy^ 

+ VifiP (- Ri) ) 

- ViUk - V 12 )) + V^6 k\\. 

Let us consider the K^+'^-valued random vectors 

ZW ^ 


z = 


X 

Y 


Xik) 

yik) 


k = 1 


5 ? ' ‘'5 


( 10 ) 
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the covariance operator of Z is given by = E(Z 0 Z') and can be writen as 



^ Fi 

Fi2 \ 



V ^21 

F 2 J 

(11) 

where F 2 = E(F ® Y) and F 21 = Fg* 2 . Further, putting 



n 

= n-^ ^ Z‘'P, and F^") = n“ 

n 

1 ^(yW - y^”^) 0 (yW - y^”^ 


k^l 


k^l 


we can write 




y(") = 

( yin) 

g'“> \ 

(12) 


y{n) 

\ ^21 

vf j 



where = n ^ ^ (y(fc) _ y^”)) j . Then we deduce from (10), 

(11) and (12) that = ||(TJ^”^) + ^/n6K\\, where = ^Jn and is the random 

operator from £(IR.p+‘?) to C{M.p) defined by 

VA e /:(Rp+9), $^^(A) = p^iA) - Pi(A)n^^y/2”) + t/in^Vi(A)n^f4”) - ViUKP2iA). 

Considering the usual operators norm || • ||oo defined in C{E,F) by ||^||oo = sup 3 ,g£;_{Q} ||Aa;||i;’/||a:||£; and 
recalling that, for two operators A and B, one has ||AB||oo < ||A||oo||S||oo, we obtain 


$^)(yl)-^h^(A)|U = 


-Pi 


{A)[ 


k) - F 12 ) 


- Hk 1 - p 


+Ci (i 


) P] 


,iA)IlKVl^'> + -ViIlKPliA)IlK (c/ 2 ”^ - -^ 12 ) 


< 


< 


||Pi(A)|U [l|n^^ - HkWooWvI^^Woo + linKlUllf^iT^ - ^i2llc 
+ll^i||oo||nK||oo||n^^ - n^||oo||W2”^||oo 
+ ll^l||oo||n;^|l^||y4”)-Fl2||oo' 

-nx||oo||W2”^||oo + linKlloollc/ 2 ”^ - V^12||oo 


+l|f"i||oolin;,|U||n^) - HkWooWvI^^Woo 

II 00.00 II ^11 c 


+ll^illoo||n;f|l^||c/”^-Fi2|| 

where ||P||oo.oo := supgig£(Rp+,)_{o} ||T(A)||oo/||^||oo- Hence 




< 


1 + l|Hi|UI|n;,|U||] ||v;^2”^||oo||n^^ - n^fiuiiPiiu,^ 

[1 + IlHilloollHxIloo] ||n;f|loo||H/2”^ - Hi2||oo||Hi||oo,oo. 


(13) 


From the strong law of large numbers it is easily seen that (resp. converges almost surely, as 

n — >• +00 to Fi (resp. F 12 ). Therefore, H^^ converges almost surely, as n —>■ +00 to H/f, and from (13) 
we deduce that converges almost surely, as n —)■ +oo to ^k- It remains to obtain the asymptotic 
distribution of We have where 


= i/n j — Zfc (8) Zfc — y J and H. 


k^l 


y/jl \ 
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The central limit theorem ensures that (resp. y/riZ^'^'^) converges in distribution, as n —> +oo, to a 

random variable H (resp. U) having a centered normal distribution with covariance operator T (resp. F') 
given by 

T = E{(Z 0 Z - V)0(Z <^Z-V)) (resp. T' = E{Z®Z) ). 

Hence, converges in probability, as n —+oo, to 0 and Slustky theorem permits to conclude that 
converges in distribution, as n —>■ +oo, to H. 

5.3. Proof of Theorem 2 

We just need to prove the lemma which is given below. Then the proof of Theorem 1 is similar than that of 
Theorem 3.1 in [10]. Let r G N* and (mi, • • • , nir) G (N*)’' such that J2£=i mi = P and 




+ m 2 + --- + T 


r-1 + 1) 


+ Tn2H-hm-r) ' 


Then, putting i? = {£ G N* / 1 < .^ < r, m^ > 2} and Ff, := | (^X]fe=o + 1, • • ’ > (Z]fe=o 
mo = 0 , we have: 

Lemma 3. If E ^ 0, then for all i £ E and all i G Fi, the sequence n“ converges in 

probability to 0 as n ^ +oo. 


n 


= Cif.(i) 

^^cr(i+l) 

if ji 

(An) 

_?{-) A 

= 



< 



< 


1^ 


(n) 




■^11 ( 






- 


(n) 


< ^ 


(n) 

K„ 


(i) 


- T 


K. 


<T(i + l) 


) II 


(n) 

K„ 


\H 


(n) 


Since di' 


(n) 

K„,. 


and dt 


(i + l) 


converge almost surely, as n —>■ +oo, to and 'i’K^ 


respectively, and 

since HM converges in distribution, as n —> +oo, to H, it follows from the preceding inequality and from 
a <112 that n“ , — ^k\ , ) converges in probability to 0 as n —>■ +oo. If 7 ^ ^ 0, we have 

y ^ct{i) -^ct(7+i) j 


(« 


K^(i) 


(i+i) 




■5 


- 11^ 


(n) 


K. 


<T(i+l) 


(ij(»))||2) 


II + ll^^j,,,, II 


2 n^ 


(( 


.1' 


(n) 

(i+l) J 


(^W) 


II + Il'J'^i.+n II 




G+i) 






(i+l) ’ 


(^W) 


ll^-'^Kci) II + ll^-^'I'^ici+i, + ^^.(.+1) II 
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where < •,• > is the inner prodcut defined by < A,B >= tr{A*B). First, 


and, further. 




< 

< 

< 


2n“-^ 


-( 

("'■>))) 

2n“-^ ( 

(■s-f.,.,.®;?!.!"'”’)) 

-k 


2n“-^ ( 




loo) 



(14) 


(15) 


Equations (14) and (15), and the above recalled convergence properties permit to conclude that the sequence 
^ converges in probability to 0, as n —>■ +oo. □ 
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